Machine Translation versus Dictionary Term Translation - A Comparison for English-Japanese News Article Alignment

نویسندگان

  • Nigel Collier
  • Hideki Hirakawa
  • Akira Kumano
چکیده

Bilingual news article alignment methods based on multilingual information retrieval have been shown to be successful for the automatic production of so-called noisy-parallel corpora. In this paper we compare the use of machine translation (MT) to the commonly used dictionary term lookup (DTL) method for Reuter news article alignment in English and Japanese. The results show the trade-off between improved lexical disambiguation provided by machine translation and extended synonym choice provided by dictionary term lookup and indicate that MT is superior to DTL only at medium and low recall levels. At high recall levels DTL has superior precision. 1 Introduction In this paper we compare the effectiveness of full machine translation (MT) and simple dictionary term lookup (DTL) for the task of English-Japanese news article alignment using the vector space model from multilingual information retrieval. Matching texts depends essentially on lexical coincidence between the English text and the Japanese translation, and we see that the two methods show the trade-off between reduced transfer ambiguity in MT and increased synonymy in DTL. Corpus-based approaches to natural language processing are now well established for tasks such as vocabulary and phrase acquisition, word sense disam-biguation and pattern learning. The continued practical application of corpus-based methods is critically dependent on the availability of corpus resources. In machine translation we are concerned with the provision of bilingual knowledge and we have found that the types of language domains which users are interested in such as news, current affairs and technology, are poorly represented in today's pub-lically available corpora. Our main area of interest is English-Japanese translation, but there are few clean parallel corpora available in large quantities. As a result we have looked at ways of automatically acquiring large amounts of parallel text for vocabulary acquisition. The World Wide Web and other Internet resources provide a potentially valuable source of parallel texts. Newswire companies for example publish news articles in various languages and various domains every day. We can expect a coincidence of content in these collections of text, but the degree of parallelism is likely to be less than is the case for texts such as the United Nations and parliamentary proceedings. Nevertheless, we can expect a coincidence of vocabulary, in the case of names of people and places, organisations and events. This time-sensitive bilingual vocabulary is valuable for machine translation and makes a significant difference to user satisfaction by improving the compre-hensibility of …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Machine Translat ion vs . Dic t ionary Term Translat ion

Bilingual news article alignment methods based on multi-lingual information retrieval have been shown to be successful for the automatic production of so-called noisy-parallel corpora. In this paper we compare the use of machine translation (MT) to the commonly used dictionary term lookup (DTL) method for Renter news article aligmnent in English and Japanese. The results show the trade-off betw...

متن کامل

Statistical machine translation: from single word models to alignment templates

In this work, new approaches for machine translation using statistical methods are described. In addition to the standard source-channel approach to statistical machine translation, a more general approach based on the maximum entropy principle is presented. Various methods for computing single-word alignments using statistical or heuristic models are described. Various smoothing techniques, me...

متن کامل

Automatic Alignment of Japanese and English Newspaper Articles using an MT System and a Bilingual Company Name Dictionary

One of the crucial parts of any corpus-based machine translation system is a large-scale bilingual corpus that is aligned at various levels such, as the sentence and phrase levels. This kind of corpus, however, is not easy to obtain, and accordingly, there is a great need for an efficient construction method. We approach this problem by integrating two large monolingual corpora in two different...

متن کامل

CASICT-DCU Neural Machine Translation Systems for WMT17

We participated in the WMT 2016 shared news translation task on English ↔ Chinese language pair. Our systems are based on the encoder-decoder neural machine translation model with the attention mechanism. We employ the Gated Recurrent Unit (GRU) with the linear associative connection to build deep encoder and address the unknown words with the dictionary replace approach. The dictionaries are e...

متن کامل

A Japanese-to-English Statistical Machine Translation System for Technical Documents

This thesis addresses a Japanese-to-English statistical machine translation (SMT) system for technical documents. Machine translation (MT) is a promising solution for growing translation needs. Japanese-to-English MT is one of the most difficult language pairs due to their large lexical and syntactic differences. This thesis work focuses on patents as the most demanded technical documents that ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998